library(bayesrules)
library(dplyr)
library(rstanarm)
library(bayesplot)
library(tidyverse)
library(tidybayes)
library(broom.mixed)
library(ggpubr)
options(scipen = 99)Bayesian Modeling Assignment
Penguin Body Mass Prediction & Fake News Detection
π§ Introduction: Two Bayesian Modeling Challenges
In this assignment, youβll apply Bayesian statistical methods to two different problems:
- Part A: Penguin Body Mass Prediction - Build and compare multiple Normal regression models
- Part B: Fake News Detection - Build a logistic regression classifier
Both parts will help you practice: - Setting appropriate priors - Simulating Bayesian models with stan_glm() - Evaluating model quality - Interpreting posterior distributions - Making predictions
- β Complete ALL challenge sections in both parts
- β Provide code AND written interpretations
- β Reference the class examples when setting priors
- β Answer all reflection questions
- β Render to PDF and submit to Canvas
Due Date: [11/6/25]
π¦ Load Libraries
π§ PART A: Penguin Body Mass Prediction
π Introduction to the Penguin Dataset
The penguins_bayes dataset contains measurements of 333 penguins from three species in Antarctica. Our goal is to predict penguin body mass using various physical measurements.
Load and Explore the Data
# Load the penguin data
data(penguins_bayes)
# Clean the data - keep only complete cases for our variables of interest
penguins_complete <- penguins_bayes %>%
select(flipper_length_mm, body_mass_g, species,
bill_length_mm, bill_depth_mm) %>%
na.omit()
# Check the data
glimpse(penguins_complete)Rows: 342
Columns: 5
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18β¦
$ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250β¦
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelβ¦
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0β¦
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2β¦
- body_mass_g: Body mass in grams (our outcome variable)
- flipper_length_mm: Flipper length in millimeters
- bill_length_mm: Bill (beak) length in millimeters
- bill_depth_mm: Bill depth in millimeters
- species: Adelie, Chinstrap, or Gentoo
π CHALLENGE 1: Exploratory Data Analysis
Before building models, explore the relationships between body mass and potential predictors.
Create the following visualizations:
- Scatter plot:
body_mass_gvsflipper_length_mm- Add a smooth trend line using
geom_smooth() - Add correlation statistics using
stat_cor()from ggpubr
- Add a smooth trend line using
- Box plot:
body_mass_gbyspecies- Use
geom_boxplot()orgeom_violin() - Color by species
- Use
- Scatter plot:
body_mass_gvsflipper_length_mmcolored byspecies- This shows if the relationship differs by species
Hints & Resources: - Review the bike sharing example from class (temp_feel vs rides) - Use ggplot() with appropriate geoms - stat_cor() from ggpubr adds correlation coefficients automatically
ggplot(penguins_complete, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(alpha = 0.7, color = "steelblue") +
geom_smooth(method = "lm", color = "darkred", se = TRUE) +
stat_cor(method = "pearson", label.x = 180, label.y = 6000) +
labs(
title = "Body Mass vs Flipper Length",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_minimal()# Plot 1: body_mass_g vs flipper_length_mm with correlationggplot(penguins_complete, aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(alpha = 0.7, outlier.color = "black") +
labs(
title = "Body Mass Distribution by Species",
x = "Species",
y = "Body Mass (g)"
) +
theme_minimal() +
theme(legend.position = "none")# Plot 2: body_mass_g by species ggplot(penguins_complete, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Body Mass vs Flipper Length by Species",
x = "Flipper Length (mm)",
y = "Body Mass (g)",
color = "Species"
) +
theme_minimal()# Plot 3: body_mass_g vs flipper_length_mm colored by speciesBased on your visualizations, answer:
- What is the correlation between flipper length and body mass? Both have a postive correlation with one another since if one gets bigger the other does as well.
- Which species tends to be heaviest? Lightest? The heaviest are the gentoo penguins while the lightest are the chinstrap penguins.
- Does the relationship between flipper length and body mass appear consistent across species? Yes it does becuase they are grouped nicely and we donβt have major outliers.
- Which predictors do you think will be most useful? Flipper length and species since we determine what the penguin is based on the size of the flipper and the species to categorize it.
π― CHALLENGE 2: Build Model 1 (Simple Regression)
Build a Bayesian normal regression model predicting body mass from flipper length only.
Model Formula: body_mass_g ~ flipper_length_mm
Setting Your Priors:
Think about reasonable values before seeing the data:
- Intercept Prior (
prior_intercept):- When flipper length = 0mm (not realistic, but mathematically necessary), what would body mass be?
- Penguins typically weigh 3000-5000g, so letβs center around 0g (since this is an extrapolation) with large uncertainty
- Try:
normal(0, 2500)
- Slope Prior (
prior):- For every 1mm increase in flipper length, how much does body mass increase?
- A penguin with longer flippers is probably heavier - maybe 20-50g per mm?
- Try:
normal(50, 25)(centered at 50g/mm, but very uncertain)
- Sigma Prior (
prior_aux):- How much do individual penguins vary from the line?
- Body mass varies quite a bit - maybe 500-1000g?
- Try:
exponential(1/500)(remember: rate = 1/mean)
Hints & Resources: - Review the bike model from class: bike_model <- stan_glm(rides ~ temp_feel, ...) - Use family = gaussian for normal regression - Set chains = 4, iter = 5000*2, seed = 84735 for reproducibility - Name your model penguin_model_1
# Build Bayesian regression model
penguin_model_1 <- stan_glm(
body_mass_g ~ flipper_length_mm,
data = penguins_complete,
family = gaussian(),
prior_intercept = normal(0, 2500), # intercept prior
prior = normal(50, 25), # slope prior
prior_aux = exponential(1 / 500), # sigma prior (1/mean)
chains = 4,
iter = 5000 * 2,
seed = 84735
)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.001537 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 15.37 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.663 seconds (Warm-up)
Chain 1: 0.494 seconds (Sampling)
Chain 1: 1.157 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.5e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.562 seconds (Warm-up)
Chain 2: 0.355 seconds (Sampling)
Chain 2: 0.917 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.436 seconds (Warm-up)
Chain 3: 0.37 seconds (Sampling)
Chain 3: 0.806 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.407 seconds (Warm-up)
Chain 4: 0.371 seconds (Sampling)
Chain 4: 0.778 seconds (Total)
Chain 4:
# Summarize model
summary(penguin_model_1)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: body_mass_g ~ flipper_length_mm
algorithm: sampling
sample: 20000 (posterior sample size)
priors: see help('prior_summary')
observations: 342
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) -5780.2 307.8 -6176.3 -5781.3 -5386.0
flipper_length_mm 49.7 1.5 47.7 49.7 51.6
sigma 395.2 15.2 376.0 394.8 414.9
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4201.2 30.7 4162.4 4201.2 4240.7
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 2.1 1.0 20602
flipper_length_mm 0.0 1.0 20625
sigma 0.1 1.0 20263
mean_PPD 0.2 1.0 20349
log-posterior 0.0 1.0 9224
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# Build penguin_model_1 predicting body_mass_g from flipper_length_mm
# penguin_model_1 <- stan_glm(...)After your model runs, print it to see: - Intercept: Where the line crosses y-axis (at flipper = 0) - flipper_length_mm: The slope - change in body mass per 1mm flipper increase - sigma: Residual standard deviation - typical prediction error - MAD_SD: Uncertainty in each parameter estimate
Look for: - Is the slope positive (heavier penguins have longer flippers)? - Whatβs the typical prediction error (sigma)?
Yes the slope is positive since it confirms heavier penguins have longer flippers. The typical prediction error is 395g which is pretty good since penguins range between 3000-6000g. The intercept means that when flipper = 0mm then body mass would be -5780g which isnβt realistic but it is just the mathematical base of the line.
π― CHALLENGE 3: Build Model 2 (Species Only)
Build a model predicting body mass from species only (no physical measurements).
Model Formula: body_mass_g ~ species
Setting Your Priors:
- Intercept Prior:
- This will be the body mass for the reference species (Adelie)
- Adelie penguins are smaller, around 3500-4000g
- Try:
normal(3750, 500)
- Slope Priors:
- These represent differences from Adelie for Chinstrap and Gentoo
- Gentoo are larger (maybe +1000g?), Chinstrap similar to Adelie
- Try:
normal(0, 1000)(centered at no difference, but allows large variation)
- Sigma Prior:
- Same reasoning as Model 1
- Try:
exponential(1/500)
penguin_model_2 <- stan_glm(
body_mass_g ~ species,
data = penguins_complete,
family = gaussian(),
prior_intercept = normal(3750, 500),
prior = normal(0, 1000),
prior_aux = exponential(1 / 500),
chains = 4,
iter = 5000 * 2,
seed = 84735
)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.547 seconds (Warm-up)
Chain 1: 0.51 seconds (Sampling)
Chain 1: 1.057 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.8e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.688 seconds (Warm-up)
Chain 2: 0.492 seconds (Sampling)
Chain 2: 1.18 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.5e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.675 seconds (Warm-up)
Chain 3: 0.534 seconds (Sampling)
Chain 3: 1.209 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.3e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.647 seconds (Warm-up)
Chain 4: 0.504 seconds (Sampling)
Chain 4: 1.151 seconds (Total)
Chain 4:
# Summarize model
summary(penguin_model_2)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: body_mass_g ~ species
algorithm: sampling
sample: 20000 (posterior sample size)
priors: see help('prior_summary')
observations: 342
predictors: 3
Estimates:
mean sd 10% 50% 90%
(Intercept) 3701.8 38.0 3653.2 3702.1 3750.3
speciesChinstrap 29.8 67.9 -56.8 29.9 116.7
speciesGentoo 1370.7 56.5 1298.6 1371.1 1442.8
sigma 463.3 18.1 440.5 462.8 487.0
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4200.7 35.6 4155.1 4200.6 4247.0
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.3 1.0 19941
speciesChinstrap 0.5 1.0 19861
speciesGentoo 0.4 1.0 19325
sigma 0.1 1.0 21960
mean_PPD 0.2 1.0 21622
log-posterior 0.0 1.0 8938
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# Build penguin_model_2 predicting body_mass_g from species
# penguin_model_2 <- stan_glm(...)When you include species in the model: - One species becomes the βreferenceβ (usually alphabetically first = Adelie) - The intercept = mean body mass for Adelie - Other coefficients = difference from Adelie - Example: If speciesChinstrap = 32, Chinstraps are 32g heavier than Adelies (on average)
π― CHALLENGE 4: Build Model 3 (Flipper + Species)
Build a model using BOTH flipper length and species as predictors.
Model Formula: body_mass_g ~ flipper_length_mm + species
Setting Your Priors: - Similar to Models 1 and 2, but now weβre combining information - Intercept: normal(0, 2500) (body mass when flipper=0 for reference species) - Slopes: normal(0, 1000, autoscale = TRUE) (let rstanarm auto-scale based on data units) - Sigma: exponential(1/500)
Hint: Use autoscale = TRUE in the normal() prior - this automatically adjusts the scale based on your data units, which is helpful when predictors are on different scales.
penguin_model_3 <- stan_glm(
body_mass_g ~ flipper_length_mm + species,
data = penguins_complete,
family = gaussian(),
prior_intercept = normal(0, 2500), # body mass when flipper = 0 (for Adelie)
prior = normal(0, 1000, autoscale = TRUE), # flexible priors scaled to predictor range
prior_aux = exponential(1 / 500), # residual SD prior
chains = 4,
iter = 5000 * 2,
seed = 84735
)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.9e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 12.129 seconds (Warm-up)
Chain 1: 6.395 seconds (Sampling)
Chain 1: 18.524 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.3e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 12.299 seconds (Warm-up)
Chain 2: 5.746 seconds (Sampling)
Chain 2: 18.045 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.3e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 13.586 seconds (Warm-up)
Chain 3: 4.944 seconds (Sampling)
Chain 3: 18.53 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.3e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 10.622 seconds (Warm-up)
Chain 4: 3.855 seconds (Sampling)
Chain 4: 14.477 seconds (Total)
Chain 4:
# Summarize model
summary(penguin_model_3)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: body_mass_g ~ flipper_length_mm + species
algorithm: sampling
sample: 20000 (posterior sample size)
priors: see help('prior_summary')
observations: 342
predictors: 4
Estimates:
mean sd 10% 50% 90%
(Intercept) -4040.5 583.9 -4791.3 -4038.5 -3289.9
flipper_length_mm 40.7 3.1 36.8 40.7 44.7
speciesChinstrap -206.3 58.5 -281.2 -206.3 -130.8
speciesGentoo 266.3 95.2 144.6 266.7 387.1
sigma 376.6 14.4 358.1 376.3 395.3
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4201.2 28.9 4164.1 4201.2 4238.5
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 4.7 1.0 15539
flipper_length_mm 0.0 1.0 15434
speciesChinstrap 0.4 1.0 19936
speciesGentoo 0.8 1.0 15225
sigma 0.2 1.0 5228
mean_PPD 0.3 1.0 8085
log-posterior 0.0 1.0 5603
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# Build penguin_model_3 with both flipper_length_mm and species
# penguin_model_3 <- stan_glm(...)When you include both flipper length AND species: - Does species still matter after accounting for flipper size? - Or is species just a proxy for βbig flippersβ? - Compare the species coefficients in Model 2 vs Model 3!
When comparing the coefficients it seems that species mostly matters since it correlates with flipper length. But when flipper length is included then species only contributes a little.
π― CHALLENGE 5: Build Model 4
Build a model using multiple physical measurements (no species).
Model Formula: body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm
Setting Your Priors: - Use the autoscaling approach since we have multiple predictors on different scales - Intercept: normal(0, 2500) - Slopes: normal(0, 1000, autoscale = TRUE) - Sigma: exponential(1/500)
penguin_model_4 <- stan_glm(
body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
data = penguins_complete,
family = gaussian(),
prior_intercept = normal(0, 2500),
prior = normal(0, 1000, autoscale = TRUE),
prior_aux = exponential(1/500),
seed = 1234
)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 5.277 seconds (Warm-up)
Chain 1: 0.477 seconds (Sampling)
Chain 1: 5.754 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 1.695 seconds (Warm-up)
Chain 2: 2.736 seconds (Sampling)
Chain 2: 4.431 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 2.5e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 7.579 seconds (Warm-up)
Chain 3: 1.855 seconds (Sampling)
Chain 3: 9.434 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.4e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 4.533 seconds (Warm-up)
Chain 4: 1.985 seconds (Sampling)
Chain 4: 6.518 seconds (Total)
Chain 4:
summary(penguin_model_4)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 342
predictors: 4
Estimates:
mean sd 10% 50% 90%
(Intercept) -6423.0 569.8 -7156.8 -6413.3 -5692.1
flipper_length_mm 50.2 2.5 47.0 50.2 53.5
bill_length_mm 4.3 5.3 -2.6 4.3 10.9
bill_depth_mm 20.1 13.8 3.2 19.8 37.8
sigma 394.3 15.8 374.2 393.7 415.1
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 4200.5 30.3 4163.0 4200.5 4239.6
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 9.7 1.0 3428
flipper_length_mm 0.0 1.0 3255
bill_length_mm 0.1 1.0 3709
bill_depth_mm 0.2 1.0 3659
sigma 0.5 1.0 844
mean_PPD 0.8 1.0 1480
log-posterior 0.1 1.0 1027
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# Build penguin_model_4 with flipper, bill length, and bill depth
# penguin_model_4 <- stan_glm(...)π CHALLENGE 6: Posterior Predictive Checks
Use pp_check() to visualize how well each modelβs predictions match the actual data.
What is a pp_check? - Generates predicted datasets from your model - Overlays them with the actual data - Good fit = predicted data (light blue) looks similar to actual data (dark blue)
Create pp_check plots for all 4 models
Hints & Resources: - Simply use: pp_check(your_model_name) - The plot shows: - Dark line = actual data distribution - Light lines = simulated predictions from posterior - Want them to overlap!
pp_check(penguin_model_1)# pp_check for Model 1pp_check(penguin_model_2)# pp_check for Model 2pp_check(penguin_model_3)# pp_check for Model 3pp_check(penguin_model_4)# pp_check for Model 4For each model, consider: 1. Do the simulated predictions (light blue) capture the shape of actual data (dark blue)? 2. Are the predictions too narrow? Too wide? Just right? 3. Which modelβs predictions look most realistic? 4. Do any models miss important features of the data?
The simulated predictions come very close in terms of matching the data. The predictions are a little wide and pretty much on track with the actual data. Model 3 looks the best cause it fits the actual data by following the line closer compared to the other models. Models 1 and 2 only use one predictor and model 4 only uses the dimensions so itβs missing out on using the predictors.
π― CHALLENGE 7: Cross-Validation Comparison
Use 10-fold cross-validation to compare the predictive accuracy of all 4 models.
What is Cross-Validation? - Splits data into 10 parts (folds) - Trains model on 9 parts, tests on 1 part - Repeats 10 times so each part gets tested - Measures: how well does the model predict NEW data it hasnβt seen?
Key Metrics: - mae (Mean Absolute Error): Average prediction error in grams - lower is better - mae_scaled: MAE relative to outcome variability - lower is better
- elpd (Expected Log Predictive Density): Overall predictive quality - higher is better
Steps: 1. Run prediction_summary_cv(model, data, k = 10) for each model 2. Compare the mae values - which model has the lowest prediction error?
Hints & Resources: - Use the same dataset for all: data = penguins_complete - Set k = 10 for 10-fold CV - This may take 2-3 minutes per model - be patient!
# Cross-validation for Model 1
prediction_summary_cv(model = penguin_model_1, data = penguins_complete, k = 10)$folds
fold mae mae_scaled within_50 within_95
1 1 266.5947 0.6713318 0.4857143 0.9428571
2 2 300.5426 0.7729209 0.4411765 0.8823529
3 3 256.8005 0.6463619 0.5294118 0.9411765
4 4 290.7628 0.7408363 0.4411765 0.9411765
5 5 196.9595 0.4936217 0.6470588 0.9705882
6 6 300.9390 0.7715755 0.4117647 0.8529412
7 7 376.3751 0.9622995 0.4411765 0.9411765
8 8 285.6118 0.7191567 0.4705882 0.9705882
9 9 145.8079 0.3615771 0.6470588 0.9705882
10 10 212.6738 0.5272587 0.6285714 1.0000000
$cv
mae mae_scaled within_50 within_95
1 263.3068 0.666694 0.5143697 0.9413445
prediction_summary_cv(model = penguin_model_2, data = penguins_complete, k = 10)$folds
fold mae mae_scaled within_50 within_95
1 1 374.3480 0.8138041 0.4571429 0.9714286
2 2 400.1078 0.8668009 0.4117647 0.9411765
3 3 348.2780 0.7498007 0.4411765 0.9705882
4 4 283.5209 0.6086689 0.5588235 0.9411765
5 5 397.6045 0.8593968 0.3823529 0.9705882
6 6 349.7943 0.7430705 0.4411765 1.0000000
7 7 323.9969 0.6925632 0.4705882 0.9411765
8 8 371.9769 0.8082410 0.4705882 0.9117647
9 9 292.7447 0.6214921 0.5882353 0.9411765
10 10 306.4385 0.6422179 0.5428571 1.0000000
$cv
mae mae_scaled within_50 within_95
1 344.881 0.7406056 0.4764706 0.9589076
# Cross-validation for Model 2prediction_summary_cv(model = penguin_model_3, data = penguins_complete, k = 10)$folds
fold mae mae_scaled within_50 within_95
1 1 251.5629 0.6644328 0.5142857 0.9142857
2 2 300.2981 0.7870111 0.4705882 1.0000000
3 3 227.6896 0.5984127 0.5294118 0.9705882
4 4 267.6877 0.7118303 0.5000000 0.9117647
5 5 319.8091 0.8510643 0.4411765 0.9705882
6 6 175.6132 0.4582463 0.5882353 0.9411765
7 7 267.6800 0.7135569 0.5000000 0.9117647
8 8 265.9879 0.6980998 0.5000000 0.9705882
9 9 213.6924 0.5595027 0.6176471 0.9411765
10 10 268.8410 0.7143473 0.4571429 0.9428571
$cv
mae mae_scaled within_50 within_95
1 255.8862 0.6756504 0.5118487 0.947479
# Cross-validation for Model 3prediction_summary_cv(model = penguin_model_4, data = penguins_complete, k = 10)$folds
fold mae mae_scaled within_50 within_95
1 1 348.5807 0.8985274 0.4285714 0.8857143
2 2 217.5245 0.5377544 0.6176471 0.9705882
3 3 369.8549 0.9563926 0.4117647 0.9117647
4 4 236.2915 0.4814529 0.6176471 0.9705882
5 5 212.8318 0.5429361 0.5882353 0.9411765
6 6 242.5776 0.5989023 0.5588235 0.9705882
7 7 224.1677 0.5683188 0.5294118 0.9411765
8 8 264.8185 0.6599565 0.5000000 0.9705882
9 9 266.4683 0.6710823 0.5000000 0.9411765
10 10 355.7692 0.5010463 0.5428571 0.9428571
$cv
mae mae_scaled within_50 within_95
1 273.8885 0.641637 0.5294958 0.9446218
# Cross-validation for Model 4Fill in the comparison table below:
| Model | Predictors | MAE (grams) | Interpretation |
|---|---|---|---|
| 1 | flipper only | 270.78 | Flipper length alone predicts body mass quite well and most of the variation is shown by the size of the flippers. |
| 2 | species only | 338.44 | Using species alone is shows less accuracy and it is only a rough estimate for size differences. |
| 3 | flipper + species | 256.46 | With flipper and species combined this has a slight improvement in predictions over just flipper. Adding species to this gives a minor adjustment. |
| 4 | all measurements | 275.36 | Using all the physical measurements results in this model being able to predict decently but is still a little worse than model 3. |
Which model would you choose and why? Consider: - Predictive accuracy (lowest MAE) - Model simplicity (fewer predictors) - Interpretability - The trade-off between complexity and performance
I would choose model 3 since it has the lowest MAE score and it uses solid predictors.
π Part A Reflection Questions
Model Selection: Which model performed best in cross-validation? Was it the most complex model?
It seems that model 3 had the best performance due to its low MAE score. It is not the most complex model since it only has to two predictors.
Species vs Measurements: Does knowing the species add information beyond physical measurements? How can you tell?
Yes it does but only a little. The species coefficients decrease when compared with model 2 which means that flipper length does most of the explaining.
Practical Use: If you were a penguin researcher with limited measurement tools, which model would you use and why?
I would use model 1 because flipper length alone is able to predict body mass. Itβs simple and fast since it only needs one measurement and it has a decent MAE score.
Prior Sensitivity: How might different priors have affected your results? Were your priors informative or vague?
The priors were mostly vague and non-informative for the slopes. They may have been weak but they helped with the sampling. Priors with more info would cause the slope estimates to shrink toward prior expectations. Wide priors could increase uncertainty.
Assumptions: What assumptions does normal regression make? Are they reasonable for penguin body mass?
Normal regression assumes that for example, body mass changes linearly with predictors like flippers and bills. That predicted and actual body mass are roughly normal. And that each penguin measurement is independent. It is reasonable that flipper length and body mass is linear since they grow together and independence is somewhat on the line since penguins from the same colony can be measured.
π° PART B: Fake News Detection
π Introduction to Fake News Classification
The fake_news dataset contains information about 150 news articles. Our goal is to build a logistic regression classifier to distinguish real news from fake news based on article characteristics.
Load and Explore the Data
# Load the fake news data
data(fake_news)
# Check the structure
glimpse(fake_news)Rows: 150
Columns: 30
$ title <chr> "Clinton's Exploited Haiti Earthquake βto Steaβ¦
$ text <chr> "0 SHARES Facebook Twitter\n\nBernard Sansaricβ¦
$ url <chr> "http://freedomdaily.com/former-haitian-senateβ¦
$ authors <chr> NA, NA, "Sierra Marlee", "Jack Shafer,Nolan D"β¦
$ type <fct> fake, real, fake, real, fake, real, fake, fakeβ¦
$ title_words <int> 17, 18, 16, 11, 9, 12, 11, 18, 10, 13, 10, 11,β¦
$ text_words <int> 219, 509, 494, 268, 479, 220, 184, 500, 677, 4β¦
$ title_char <int> 110, 95, 96, 60, 54, 66, 86, 104, 66, 81, 59, β¦
$ text_char <int> 1444, 3016, 2881, 1674, 2813, 1351, 1128, 3112β¦
$ title_caps <int> 0, 0, 1, 0, 0, 1, 0, 2, 1, 1, 0, 1, 0, 0, 0, 0β¦
$ text_caps <int> 1, 1, 3, 3, 0, 0, 0, 12, 12, 1, 2, 5, 1, 1, 6,β¦
$ title_caps_percent <dbl> 0.000000, 0.000000, 6.250000, 0.000000, 0.0000β¦
$ text_caps_percent <dbl> 0.4566210, 0.1964637, 0.6072874, 1.1194030, 0.β¦
$ title_excl <int> 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0β¦
$ text_excl <int> 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0β¦
$ title_excl_percent <dbl> 0.0000000, 0.0000000, 2.0833333, 0.0000000, 0.β¦
$ text_excl_percent <dbl> 0.00000000, 0.00000000, 0.06942034, 0.00000000β¦
$ title_has_excl <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSEβ¦
$ anger <dbl> 4.24, 2.28, 1.18, 4.66, 0.82, 1.29, 2.56, 3.47β¦
$ anticipation <dbl> 2.12, 1.71, 2.16, 1.79, 1.23, 0.43, 2.05, 1.74β¦
$ disgust <dbl> 2.54, 1.90, 0.98, 1.79, 0.41, 1.72, 2.05, 1.35β¦
$ fear <dbl> 3.81, 1.90, 1.57, 4.30, 0.82, 0.43, 5.13, 4.25β¦
$ joy <dbl> 1.27, 1.71, 1.96, 0.36, 1.23, 0.86, 1.54, 1.35β¦
$ sadness <dbl> 4.66, 1.33, 0.78, 1.79, 0.82, 0.86, 2.05, 1.93β¦
$ surprise <dbl> 2.12, 1.14, 1.18, 1.79, 0.82, 0.86, 1.03, 1.35β¦
$ trust <dbl> 2.97, 4.17, 3.73, 2.51, 2.46, 2.16, 5.13, 3.86β¦
$ negative <dbl> 8.47, 4.74, 3.33, 6.09, 2.66, 3.02, 4.10, 4.63β¦
$ positive <dbl> 3.81, 4.93, 5.49, 2.15, 4.30, 2.16, 4.10, 4.25β¦
$ text_syllables <int> 395, 845, 806, 461, 761, 376, 326, 891, 1133, β¦
$ text_syllables_per_word <dbl> 1.803653, 1.660118, 1.631579, 1.720149, 1.5887β¦
- type: βRealβ or βFakeβ (our outcome variable)
- title_has_excl: Does the title contain an exclamation point? (TRUE/FALSE)
- title_words: Number of words in the article title
- negative: Negative sentiment rating of the article (0-1 scale)
π CHALLENGE 8: Exploratory Data Analysis for Classification
Explore how article characteristics differ between real and fake news.
Create the following visualizations:
- Bar chart: Count of Real vs Fake articles
- Use
geom_bar()withfill = type
- Use
- Box plots: Compare
title_wordsandnegativebetween Real and Fake- Create two separate plots using
geom_boxplot()
- Create two separate plots using
- Proportions: What percentage of articles with exclamation points are fake?
- Use
count()andmutate()to calculate proportions
- Use
Hints & Resources: - Review the weather Perth rain examples from class
ggplot(fake_news, aes(x = type, fill = type)) +
geom_bar() +
labs(title = "Count of Real vs Fake Articles",
x = "Article Type",
y = "Number of Articles") +
theme_minimal()# Plot 1: Bar chart of Real vs Fakeggplot(fake_news, aes(x = type, y = title_words, fill = type)) +
geom_boxplot() +
labs(title = "Distribution of Title Word Count by Article Type",
x = "Article Type",
y = "Number of Words in Title") +
theme_minimal()# Plot 2: Box plot of title_words by typeggplot(fake_news, aes(x = type, y = negative, fill = type)) +
geom_boxplot() +
labs(title = "Distribution of Negative Sentiment by Article Type",
x = "Article Type",
y = "Negative Sentiment Score") +
theme_minimal()# Plot 3: Box plot of negative sentiment by typefake_news %>%
filter(title_has_excl == TRUE) %>%
count(type) %>%
mutate(prop = n / sum(n) * 100) # Calculate proportion of fake news by exclamation point presenceWhat proportion of articles in the dataset are fake?
Roughly 40% of the articles are fake.
Do fake news articles use more exclamation points?
Yes it seems that about 88% in this case do so.
Do fake and real news differ in title length?
It seems that fake news tend to have longer titles
Is negative sentiment associated with fake news?
Fake news articles are usually more negative.
π― CHALLENGE 9: Build Fake News Classifier
Build a Bayesian logistic regression model to predict whether an article is fake based on the three predictors.
Model Formula: type ~ title_has_excl + title_words + negative
Understanding Logistic Regression Priors:
Remember: In logistic regression, we model log-odds, not probabilities directly!
Prior for Intercept (baseline log-odds when all predictors = 0): - Think: What % of articles are fake when they have no exclamation point, average title length, and neutral sentiment? - From your EDA, roughly 50% are fake overall = 50/50 odds = log-odds of 0 - But weβre uncertain, so use: normal(0, 1.5) - plogis(0) = 50% probability - plogis(-1.5) β 18%, plogis(1.5) β 82% (wide range!)
Priors for Slopes: - These represent how much log-odds change per unit increase in predictor - Weβre uncertain about effect sizes, so use weakly informative priors - Try: normal(0, 1, autoscale = TRUE) - This says: effects could be positive or negative, but probably not extreme
Important: Use family = binomial for logistic regression!
Hints & Resources: - Review the rain_model_1 from class: stan_glm(raintomorrow ~ humidity9am, family = binomial, ...) - Remember: outcome must be a factor or binary (0/1) - Set chains = 4, iter = 5000*2, seed = 84735 - Name your model fake_news_model
Your Code:
fake_news_model <- stan_glm(
type ~ title_has_excl + title_words + negative,
data = fake_news,
family = binomial,
prior_intercept = normal(0, 1.5),
prior = normal(0, 1, autoscale = TRUE),
chains = 4,
iter = 5000 * 2,
seed = 84735
)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2.8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.599 seconds (Warm-up)
Chain 1: 0.677 seconds (Sampling)
Chain 1: 1.276 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 2.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.545 seconds (Warm-up)
Chain 2: 0.527 seconds (Sampling)
Chain 2: 1.072 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.4 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.412 seconds (Warm-up)
Chain 3: 0.437 seconds (Sampling)
Chain 3: 0.849 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 2.6e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.38 seconds (Warm-up)
Chain 4: 0.439 seconds (Sampling)
Chain 4: 0.819 seconds (Total)
Chain 4:
print(fake_news_model)stan_glm
family: binomial [logit]
formula: type ~ title_has_excl + title_words + negative
observations: 150
predictors: 4
------
Median MAD_SD
(Intercept) 2.9 0.8
title_has_exclTRUE -2.5 0.8
title_words -0.1 0.1
negative -0.3 0.2
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
# Build logistic regression model for fake news detection
# fake_news_model <- stan_glm(...)After printing your model: - Intercept: Log-odds of REAL news at baseline (all predictors = 0) - Coefficients: Change in log-odds per unit increase in predictor - Important: The model predicts probability of REAL news (coded as 1) - Positive coefficient = increases odds of REAL news - Negative coefficient = increases odds of FAKE news (decreases odds of real)
π― CHALLENGE 10: Interpret Odds Ratios
Convert the log-odds coefficients to odds ratios to make them interpretable.
What are Odds Ratios? - Odds Ratio (OR) = exp(coefficient) - OR > 1 means predictor increases odds of REAL news - OR < 1 means predictor increases odds of FAKE news (decreases odds of real) - OR = 1 means no effect
Calculate 80% Credible Intervals:
Use: exp(posterior_interval(fake_news_model, prob = 0.80))
Interpret the results (remember: model predicts REAL news): - For title_has_exclTRUE: If OR < 1, exclamation points are associated with FAKE news - For title_words: How does title length affect the odds of real vs fake? - For negative: How does negative sentiment affect the odds of real vs fake?
Hints & Resources: - Review the rain model interpretation from class - To convert OR to percent change: (OR - 1) * 100 - Example: OR = 0.75 means -25% (reduces odds of REAL news by 25% = increases odds of FAKE) - Example: OR = 1.25 means +25% (increases odds of REAL news by 25%)
coef_est <- coef(fake_news_model)
odds_ratios <- exp(coef_est)
odds_ratios (Intercept) title_has_exclTRUE title_words negative
18.26494243 0.08522257 0.89448542 0.72902748
exp(posterior_interval(fake_news_model, prob = 0.80)) 10% 90%
(Intercept) 7.12361124 49.4424472
title_has_exclTRUE 0.02891118 0.2109089
title_words 0.83154943 0.9616440
negative 0.59926930 0.8812999
# Calculate and display odds ratios with 80% credible intervals
# exp(posterior_interval(...))In Rβs logistic regression, the outcome is coded alphabetically: - βFakeβ (first alphabetically) = 0 - βRealβ (second alphabetically) = 1
This means your model predicts the probability of REAL news, not fake news!
When interpreting odds ratios: - OR > 1 means predictor increases odds of REAL news - OR < 1 means predictor increases odds of FAKE news
For each predictor, write 1-2 sentences interpreting the odds ratio. Remember: OR < 1 means the predictor is associated with FAKE news!
Example interpretation for title_has_excl (OR = 0.03 to 0.21):
βArticles with exclamation points have 79-97% lower odds of being REAL news (80% CI: 0.03, 0.21). This means exclamation points are strongly associated with FAKE news - articles with ! are much more likely to be fake.β
To calculate percent change: (OR - 1) Γ 100 - (0.03 - 1) Γ 100 = -97% - (0.21 - 1) Γ 100 = -79%
Now write interpretations for:
title_words (OR = 0.83 to 0.96): How does title length relate to fake vs real news?
negative (OR = 0.60 to 0.88): How does negative sentiment relate to fake vs real news?
Summary question: Based on these odds ratios, what characteristics define fake news in this dataset?
-Each additional word in the title is associated with 4β17% lower odds of being REAL news (80% CI: 0.83, 0.96). This means that longer titles are slightly more likely to appear in FAKE news articles.
-Higher negative sentiment scores are associated with 12β40% lower odds of being REAL news (80% CI: 0.60, 0.88). Articles with more negative language are therefore more likely to be FAKE news.
-Articles with exclamation points in the title are much more likely to be fake.
-Articles with slightly longer titles tend to be fake.
-Articles with more negative sentiment are more likely to be fake.
π― CHALLENGE 11: Classification Performance (Cutoff = 0.5)
Evaluate how well your model classifies articles as real or fake using a 0.5 probability cutoff.
What does this mean? - If model predicts P(Fake) > 0.5, classify as Fake - If model predicts P(Fake) β€ 0.5, classify as Real
Use: classification_summary(model = fake_news_model, data = fake_news, cutoff = 0.5)
Key Metrics to Understand:
| Metric | Formula | Interpretation |
|---|---|---|
| Accuracy | (TP + TN) / Total | Overall % correct |
| Sensitivity (True Positive Rate) | TP / (TP + FN) | % of fake news correctly identified |
| Specificity (True Negative Rate) | TN / (TN + FP) | % of real news correctly identified |
Where: - TP = True Positives (correctly identified fake news) - TN = True Negatives (correctly identified real news)
- FP = False Positives (real news wrongly called fake) - FN = False Negatives (fake news wrongly called real)
classification_summary(
model = fake_news_model,
data = fake_news,
cutoff = 0.5
)$confusion_matrix
y 0 1
fake 26 34
real 6 84
$accuracy_rates
sensitivity 0.9333333
specificity 0.4333333
overall_accuracy 0.7333333
# Evaluate classification at cutoff = 0.5
# classification_summary(...)What is the overall accuracy? Is this good?
74% of all articles were correctly classified.
So this is not bad, but not perfect. The model is correct about 3 out of 4 times overall. Accuracy alone doesnβt tell the whole story because the classes are unbalanced.
What is the sensitivity? Are we good at identifying real news?
93.3% of fake news articles were correctly identified.
This result is excellent! The model is very good at catching fake news.
What is the specificity? Are we good at catching fake news?
45% of real news articles were correctly identified.
the result for this model is poor. The model often misclassifies real news as fake which is 55% of real news being misclassified. This is a lot of false positives.
Which type of error is more problematic:
False positives (calling fake news βrealβ)?
False negatives (calling real news βfakeβ)?
False negatives (failing to identify fake news) this is more problematic since the goal is to catch fake news, because a fake article can slip through.
π― CHALLENGE 12: Adjusting the Classification Threshold
Explore how changing the cutoff threshold affects classification performance.
The Trade-off: - Lower cutoff (e.g., 0.3): Easier to classify as Real - β Catches more real news (higher sensitivity) - β More false alarms - labels more fake as real (lower specificity)
- Higher cutoff (e.g., 0.7): Harder to classify as Real
- β Misses more real news (lower sensitivity)
- β Fewer false alarms - better at catching fake (higher specificity)
Remember: Since model predicts P(Real), lowering the cutoff makes it EASIER to call something real (and thus HARDER to call it fake).
Try these cutoffs and compare: 1. cutoff = 0.3 (liberal - more willing to call articles βrealβ) 2. cutoff = 0.7 (conservative - more skeptical, flags more as βfakeβ)
classification_summary(
model = fake_news_model,
data = fake_news,
cutoff = 0.3
)$confusion_matrix
y 0 1
fake 17 43
real 2 88
$accuracy_rates
sensitivity 0.9777778
specificity 0.2833333
overall_accuracy 0.7000000
# Evaluate classification at cutoff = 0.3classification_summary(
model = fake_news_model,
data = fake_news,
cutoff = 0.7
)$confusion_matrix
y 0 1
fake 48 12
real 43 47
$accuracy_rates
sensitivity 0.5222222
specificity 0.8000000
overall_accuracy 0.6333333
# Evaluate classification at cutoff = 0.7Fill in the comparison table:
| Cutoff | Accuracy | Sensitivity (ID Real) | Specificity (ID Fake) | Best For⦠|
|---|---|---|---|---|
| 0.3 | 0.70 | 0.978 | 0.283 | Social media platform. High sensitivity catches almost all real news, but many fake articles slip through |
| 0.5 | 0.74 | 0.933 | 0.450 | This is a balanced reasonable trade-off between catching fake news and not mislabeling real news |
| 0.7 | 0.633 | 0.522 | 0.800 | Fact-checking websites.Prioritizes catching fake news, but many real articles may be falsely flagged |
Discussion: Which cutoff would you choose if: - You run a social media platform (want to avoid censoring real news)? - Youβre a fact-checking website (want to flag as much fake news as possible)? - Remember: Lowering cutoff = easier to classify as βrealβ = harder to flag as βfakeβ
Social media platform: Use a lower cutoff (0.3) to avoid censoring real news. This will catch nearly all real articles (high sensitivity) even if we miss some fake news.
Fact-checking website: Use a higher cutoff (0.7) to flag as much fake news as possible. Higher specificity ensures fake articles are caught, but some real news might be falsely flagged.
π― CHALLENGE 13: Visualize Model Predictions
Create a visualization showing how predicted probabilities vary with one or more predictors.
Fitted Draws Plot
Use add_fitted_draws() to show the relationship between a predictor and predicted probability:
fake_news %>%
add_fitted_draws(fake_news_model, n = 100) %>%
ggplot(aes(x = title_words, y = type)) +
geom_line(aes(y = .value, group = .draw), alpha = 0.15) +
labs(y = "probability of REAL news")Note: .value will be between 0 and 1, representing P(Real news)
fake_news %>%
add_fitted_draws(fake_news_model, n = 100) %>% # 100 draws from posterior
ggplot(aes(x = title_words, y = .value)) +
geom_line(aes(group = .draw), alpha = 0.15) + # Draw multiple lines for uncertainty
geom_smooth(aes(y = .value), method = "loess", color = "blue") + # Optional smooth summary
labs(
x = "Title Words",
y = "Predicted Probability of REAL News",
title = "Predicted Probability of REAL News vs. Title Length"
) +
theme_minimal()# Visualize model predictionsWhat does this plot tell you about: 1. The relationship between title length and the probability of real news? 2. How does this relationship differ for articles with/without exclamation points? 3. Where is the model most/least certain? 4. Does the visualization support your odds ratio interpretations?
1.As title length increases, the predicted probability of REAL news decreases.
Shorter titles are more likely to be real, while longer titles are more likely to be fake.
This aligns with the odds ratio for title_words (OR < 1), meaning each additional word reduces the odds of being real news.
2. Articles with exclamation points would generally have lower predicted probability of being real, based on the earlier odds ratio (OR β 0.03β0.21).
If we split the plot by title_has_excl, we would see a shift downward for titles with !, meaning they are more likely to be fake at any given title length.
3.The spread of black lines shows posterior uncertainty. The tightly clustered lines show the model is more certain.
The widely spread lines show the model is less certain.
This makes sense since fewer observations with extreme title lengths lead to more uncertainty.
4. Yes the visualization supports the odds ratios.
Longer titles = lower probability of REAL news.
Exclamation points are strongly associated with FAKE news with a very low probability of being real.
Negative sentiment would similarly show a downward trend if plotted.
π Part B Reflection Questions
Feature Importance: Which predictor(s) were most strongly associated with fake news? (Hint: look for OR farthest from 1.0) How can you tell from the odds ratios?
Title length: Each additional word in the title 4β17% lower odds of being REAL news so longer titles are slightly more likely to be FAKE news.
Negative sentiment: Higher negative sentiment 12β40% lower odds of being REAL news so articles with more negative language are more likely to be FAKE news.
Exclamation points: Titles with exclamation points are much more likely to be FAKE news(strongest predictor).
Classification Trade-offs: In the context of fake news detection, is it worse to:
- Flag real news as fake (false negative - blocks legitimate information)?
- Miss fake news and label it as real (false positive - spreads misinformation)?
- How should this influence your cutoff choice?
In fake news detection, many argue false negatives are worse, because letting fake news spread can be more damaging than temporarily flagging some real articles.
If minimizing false positives is important, we might lower the threshold to catch more potential fake news, accepting some real news might be flagged.
Model Limitations: What information is this model missing? What other predictors might be useful (e.g., source, author, date)?
Source, publisher credibility
Author history or reputation
Publication date, recency
Social engagement metrics like shares and comments
Causality Warning: Can you say that exclamation points cause an article to be fake? Why or why not? Whatβs the difference between association and causation?
No because the model shows association, not causation.
Association is that articles with exclamation points are more likely to be fake.
Causation would be that adding exclamation points would make an article fake.
Real-world Application: How would you deploy this model in practice? What additional validation would you need before using it to flag articles?
Wee could integrate the model into a news aggregator or social media platform to flag potential fake articles.
Additional validation needed would be to test on out-of-sample or new datasets.
Evaluate precision and recall to balance false positives vs.Β false negatives.
Possibly use human review for flagged articles to avoid mistakes.
Monitor over time for concept drift since language patterns in fake news can change.
π Final Submission Checklist
π€ How to Submit
Step 1: Render to HTML
Click the blue βRenderβ button in RStudio and wait for completion.
Step 2: Open in Browser
Click the βShow in new windowβ icon in the Viewer pane.
Step 3: Save as PDF
Right click to print and then save as pdf. Upload the pdf to canvas